Introduction


Project Introduction

In this project, I focused on exploring the data from a business intelligence standpoint.

In the first part, I analyzed TV, App, and Website datasets seperately and summarized my findings in the form of table and plots. I also explored the connections among three datasets, trying to see what actional insights we can gain. Each of my analyses begins with a question, followed with approaches and data visualizations, and finally gives my conclusions/recommendations.

In the second part, I conducted audience segmentation using TV data and brainstormed about how segmentation can be improved if more data is available and gave marketing suggestions.

Lastly, I listed what I would like to do in order to further explore the data as the next step.


Brief Data Introduction

The TV dataset contains audience behavior information including view time, tv programs and network for each user in the week of 2017-01-02. It originally contains 1062961 observations and 30 variables. After data cleaning and manipulation such as removing unreasonable records, converting data types, and removing duplicate/useless columns, it contains 531172 observations and 26 variables (use id included).

The App data contains app usage information including App name, device used, and time used on each App for each user on 2017-01-02. It originally contains 797563 observations and 8 variables. After removing duplicate/useless columns and removing unreasonable records, it contains 796545 observations and 6 variables (use id included).

The original Web data contains website usage information including website name, device used, and time spent on each website for each user on 2017-01-02. It originally contains 33521 observations and 8 variables. Records with TOTAL_MINUTES less than 1 are removed from the dataset. After removing duplicate/useless columns and removing records with unreasonable records, it contains 15168 observations and 7 variables (use id included).


Exploratory Data Analysis and Data Understanding


TV Usage

How are audiences distriuted by day of week?

I first wanted to see the if number of TV audience and watching time differs by day of week. Number of unique audience on each day, total time spent(in hrs) on each day, and average time spent(in hrs) are calculated and summarized in the following table. I expect that Friday and Weekends will have more audiences and average spent time on TV than any day from Monday to Thursday.

Please note that number of TV audience is calculated by counting the number of unique user id to prevent duplicate counts for users.

From the above table we can see that in the week of 2017-01-02, number of audience are very similar on each day - around 12k. Thursday, Sunday and Friday have the most audiences.

The total amount of hours spent on TV is from around 25k to 35k with Sunday (34907 hours) and Saturday (31565) being the most and second most.

Sunday and Saturday also had the top two average view time: on average, an audience spent 2.85 and 2.62 hours on Sunday and Saturday, respectively.

People watched TV about 30 minutes more per day on weekends(2.73 hrs) than on weekdays(2.23 hrs).

To my suprise, Monday is on the third most total hours spent (30531.14 hours) and it also has the third most average view time, 2.53 hours.


How are LIVE audiences distributed by air day part?

To see how audience behaviors differ by part of day the show is on air, we need to know audiences’ view time. However since we don’t have exact view time information in the dataset, I only analyzed the LIVE audiences whose view time can reference Air time.

I first created a subset for LIVE audiences and then counted the number of users in each AIR_DAY_PART.
Please note that number of TV audience is calculated by counting the number of unique user id to prevent duplicate counts for users.

We can see that during a day, most audiences watch Live TV during Prime time(8pm - 11pm). Then it follows by Daytime(9am - 3pm) and Early Fringe(3pm-5pm). Overnight have the least number of audiences throughout the day.


How are LIVE audiences distributed by air day part on weekdays and weekends?

If calculate average number of LIVE audience in weekdays and weekends seperately as below, we can see that although prime hours were peak TV viewing times on both weekend days and weekdays, number of audience was greater on weekend days beginning from 9PM and continuing until 5AM. In another way, number of audience on weekday is greater than that on weekday only during Early Morning.


What is the distribution of video view platforms?

Today’s program content is viewed on more than just television sets. Consumers are watching via the Internet and on mobile devices, in-home and out-of-home, live and time-shifted. Therefore, I would like to see how video view platform is distributed among our audience.

To approach this problem, I grouped the data by video platform categories and then counted the number of unique users indicated by unique USER_META_ID.

I chose pie chart here to show each population because it shows both the proportion and number of each population, which enables us to compare 4 populations easily. We can see that around 60% of audiences watched TV through either Live or DVR.


App Usage


How much time is spent on different categories of Apps?

To approach this problem, I calculated the total time spent on each App category. Then I calculated the proportion of total time spent on each App category. There are 17 categories such as Food&Drink, Family&Kids, Educational took up less than 5% of the total time, therefore they are categorized as ‘Others’.

The following pie chart can clearly show the proportion of time spent in each App category of the total time.

From the above pie chart we can see tha Social(18.5%) and Entertainment&Lifestyle(13.8%) are the major App categories. Then it follows Email&Communication, Misc, and Browser.


Among Entertainment&Lifestyle, what are the major Apps used?


I paid extra attention on Entertainment&Lifestyle Apps because it contains streaming TV services and video platforms such as Youtube, Netflix, Hulu, Roku, Hbo Go and so on. Why is this important?

From what Los Angelas Times mentioned, ‘while the majority of viewers watch the old-fashioned way — live and seated in front of a TV screen — new technologies are rapidly transforming the way programming is consumed. The upending of television is being led by digital video recorders, video on-demand and streaming sites such as Netflix, Hulu and Amazon that can be watched on mobile phones and tablets as well’.[https://www.latimes.com/entertainment/tv/la-et-st-tv-section-ratings-20141123-story.html]

This indicates since more audiences tend to watch shows using their mobile devices, it becomes hard to track users’ watching behaviors and measure their content consumptions, and therefore challenged to create completed user profile.

Therefore, I looked at the App consumption within Entertainment&Lifestyle and see if our users spent much time on streaming services on their mobile devices.



There are many Apps within Entertainment&Lifestyle. Here I am only picking up a few popular streaming services and look at users’ time consumption on them.

On 2017-01-02,

  • 3163 users used Youtube and spent 21 minutes on it on average.
  • 555 users used Netflix and spent 23 minutes on it on average.
  • 148 users used Hulu Plus and spent 14 minutes on it on average.
  • 53 users used Youtube Kids and spent 38 minutes on it on average.
  • 80 users used Roku and spent 18 minutes on it on average.
  • 44 users used Entertainow Tv Mobile and spent 41 minutes on it on average.
  • 45 users used Lifestylz.tv and spent 36 minutes on it on average.


This is not a completed list for streaming services and video platforms, however it already shows that many users spent much time on them, taking the market share from traditional TV viewing.

Cross-platform tracking can make it possible to get more accurate statistics of the users and comprehensive info about the users since users’ identities are not split into pieces over multiple platforms(cable TV, streaming services, etc). For our case, if we have data on audience view information such as program information, view time, and elspsed time collected from streaming TV apps and video platform Apps, we will be able to better analyze users’ viewing behavior and create customized marketing strategies for different audiences, e.g. recommend customized tv programs and send promotion advertisements.


Website Usage


How is Website Usage distributed by Day Part?



Note: Time is in minutes.

Most users spent time on Websites during 9AM to 5PM, and on average, each user spent the most time during this period of time as well.


Are there any difference between web usage and TV usage on different day parts?

The results from above analysis makes me wonder whether there is a difference between web usage and TV usage on different day parts. Therefore, I further created bar charts to compare the usages below:

From above bar chart we can see that on 2017/01/02, our participates who watched LIVE TV spent much more time on TV than website on all day parts. Day time is most liked for both TV viewers and website viewers.
Please notice that our data for website usage is only for 2017-01-02, so the above conclusion can be very biased and therefore the conclusion may applicable for other times.


How time was consumed on different websites?


To answer this question, I calculated the total time spent, number of users, and average time spent on each website.

  • Facebook, Google, Amazon, Graigslist and Youtube are the top 5 websites browsed in terms of total time.
  • Google, Facebook, Amazon, Wiki, and Perksplus are the top 5 websites that have the most users.

Our data is limited in the sense that we only know what webpage users browsed but NOT the content they looked at(not sure if it’s legal). For example, if we know that a user checked Twitter and read about discussions about a recent movie, it is likely that this user is interested in the movie.


Audience Segmentation


One example from TV data


Given the background that at Viacom, the focus of our business is to engage global audiences and deliver compelling content to our fans across all platforms, audience segmentation seems to be a topic that deserves much focus. In this project, I measured audience behavior by program info from TV data. To be more specific, I selected AIR_DAY_PART_DESC, AIR_DOW, TIMESHIFT_INDICATOR_DESC, MASTER_GENRE_DESC, VIDEO_VIEW_PLATFORM, SYNDICATION_GROUP, PROGRAM_NAME from TV data as well as a new feature which is percentage of time watched for a show (ELAPSED_TIME/SHOW_DURATION). Audiences were clustered into the same group only if all of these features of them are same.

Due to time limit, I only picked one cluster here as an example:

##   USERS_META_ID AIR_DAY_PART_DESC AIR_DOW TIMESHIFT_INDICATOR_DESC
## 1       2120025  Prime (8PM-11PM)     MON                On demand
## 2       2331905  Prime (8PM-11PM)     MON                On demand
## 3        208481  Prime (8PM-11PM)     MON                On demand
## 4       2367253  Prime (8PM-11PM)     MON                On demand
## 5       2326427  Prime (8PM-11PM)     MON                On demand
## 6       2260066  Prime (8PM-11PM)     MON                On demand
##   MASTER_GENRE_DESC VIDEO_VIEW_PLATFORM SYNDICATION_GROUP   PROGRAM_NAME
## 1             Drama                 OTT Broadcast Network Grey's Anatomy
## 2             Drama                 OTT Broadcast Network Grey's Anatomy
## 3             Drama                 OTT Broadcast Network Grey's Anatomy
## 4             Drama                 OTT Broadcast Network Grey's Anatomy
## 5             Drama                 OTT Broadcast Network Grey's Anatomy
## 6             Drama                 OTT Broadcast Network Grey's Anatomy
##     p Cluster_ID
## 1 0.9     128020
## 2 0.9     128020
## 3 0.9     128020
## 4 0.9     128020
## 5 0.9     128020
## 6 0.9     128020


We see that the above users all watched Grey’s Anatomy on streaming platforms during 8PM-11PM and finished 90% time of the show. This might indicate these users have similar watching behavior. But it can be a coincidence since Grey’s Anatomy is very popular and many people watch TV during Prime time. Let’s pick two users and see whether or not they both watched other shows that are within the same genre. to see if they share similar tastes in TV. For example, users with id 2120025 and 2367253.

  • USERS_META_ID = 2120025:


Action/ Adventure/ SciFi Art/Music Awards & Specials Children Comedy Documentary Drama Game show
1 1 0 0 13 0 13 0
Instruction/ Information News Other Reality Spanish Language Sports Sports talk Talk/Variety
0 0 0 0 0 0 0 0
DVR LIVE OTT VOD
0 12 16 0

From the above table, we can see that USERS_META_ID 2120025 watched drama for 13 times and comedy for 13 times as well. Out of 28 records, s/he watched TV on streaming services for 16 times and LIVE for 12 times.

  • USERS_META_ID = 2367253:


Action/ Adventure/ SciFi Art/Music Awards & Specials Children Comedy Documentary Drama Game show
2 0 0 0 6 0 14 0
Instruction/ Information News Other Reality Spanish Language Sports Sports talk Talk/Variety
0 0 0 5 0 0 0 0
DVR LIVE OTT VOD
1 7 19 0

From the above tables, we can see that USERS_META_ID 2367253 watched drama for 14 times and comedy for 6 times as well. Out of 27 records, s/he watched TV on streaming services for 19 times and LIVE for 7 times.

By comparing genres and video platforms of these 2 users, we see that both users prefer watching dramas and comedies through streaming platforms. Therefore, a marketing recommendation is that we can recommend contents that user 2367253 watched to user 2120025 and vice versa because they might like to watch similar TV programs. Since both of them watched more on streaming platforms, promotions for streaming services can be considered for both of them.

We can not certainly say that two people who ever watched a same TV program and have similar preferences for program genres have exactly same taste or watching behavior because there are many other factors to consider. However, this can serve as a reference for audience segmentation and further marketing strategy creations.


Clustering methods tried for segmentation


  1. I have also tried clustering models for mixed data type using partitioning around medoids (PAM).

Partitioning around medoids is an iterative clustering procedure with the following steps:[https://www.r-bloggers.com/clustering-mixed-data-types-in-r/]

  • Choose k random entities to become the medoids
  • Assign every entity to its closest medoid (using our custom distance matrix in this case)
  • For each cluster, identify the observation that would yield the lowest average distance if it were to be re-assigned as the medoid. If so, make this observation the new medoid.
  • If at least one medoid has changed, return to step 2. Otherwise, end the algorithm.

It is too time and memory consuming that I couldn’t run it locally. I also tried running it on AWS free instance but it also failed because of running out of memory.

  1. I also tried subsetting data so we only have categorical features and then I applied kmodes: K-Modes Clustering algorithm. I didn’t encounter running time and memory issue but the clustering results are not desired because there are many clear overlapps among clusters.

Other thoughts on audience segmentation


1. Audience Info

If we have viewing data along with audiences’demographic and psychographic profiles, we would be able to better picture our audiences and therefore building an more accurate audience segmentation model. For example, for two people B and C who have the same gender, age level and occupation as well as some overlapped TV programs subscription, it is likely that they share similar tastes on TV shows. A TV show that B likes might also be interesting to C and vice versa. Therefore, based on user similarities, we can taylor out content recommendations for users.


2. Social Media Data

  • Social media data can also help with andience segmentation. For example, if we know person A liked a Facebook page about a sitcom that is about to show LIVE next month. It is likely that A likes this sitcom and wants to watch the show. Therefore, we can send A relevant offers and advertisements.

  • Another thing about social media is that we can perform sentiment analysis on users’ comments on TV programs to see their opinions on them. If we have enough reviews data, we can learn each person or groups preference, and therefore recommending taylored contents.


3. Purchasing Behavior Data

If we have users’ purchasing behavior data, we can estimate how much an audience is worth to our company by calculating customer lifetime value (CLV) by means of Recency-Frequency-Monetary framework. We can then cluster users in terms of their CLV and loyalty and create targeted market plans (ads, campaigns, etc.) for different user groups.


What I would do if more time is given…


  1. As I have found clusters of users who share similar watching behavior and content consumption in the first part of Audience Segmentation, the next thing is to see whether these users also have similarities in terms of App usage and Website usage.

  2. Since one user may have multiple devices, it is possible that they have different preferences about devices for various tasks. For example, a phone is majorly used for contacts while a tablet is more used for entertainment including gaming, videos, and so on. Cross-device tracking enables us to get more accurate statistics of the users and comprehensive info about the users.